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Abstract 

Background: Hidden Markov Models (HMM) are often used for analyzing Comparative Genomic Hybridization 
(CGH) data to identify chromosomal aberrations or copy number variations by segmenting observation sequences. 
For efficiency reasons the parameters of a HMM are often estimated with maximum likelihood and a segmentation 
is obtained with the Viterbi algorithm. This introduces considerable uncertainty in the segmentation, which can be 
avoided with Bayesian approaches integrating out parameters using Markov Chain Monte Carlo (MCMC) sampling. 
While the advantages of Bayesian approaches have been clearly demonstrated, the likelihood based approaches 
are still preferred in practice for their lower running times; datasets coming from high-density arrays and next 
generation sequencing amplify these problems. 

Results: We propose an approximate sampling technique, inspired by compression of discrete sequences in HMM 
computations and by /cd-trees to leverage spatial relations between data points in typical data sets, to speed up 
the MCMC sampling. 

Conclusions: We test our approximate sampling method on simulated and biological ArrayCGH datasets and high- 
density SNP arrays, and demonstrate a speed-up of 10 to 60 respectively 90 while achieving competitive results 
with the state-of-the art Bayesian approaches. 

Availability: An implementation of our method will be made available as part of the open source GHMM library 
from http://ghmm.org. 



Background 

The Sirens' call of Bayesian methods is that we can do 
without the parameters of models and, instead, compute 
probabilities of interest directly, indicating for example 
how likely a biological fact is given our data. The price one 
pays for that convenience is on one hand the conundrum 
of which prior distributions to choose and how to set their 
hyper-parameters; the frequent use of uniform priors is 
evidence that this is indeed non-trivial On the other hand, 
unless the choice of likelihood and prior yields simple pos- 
teriors which we can integrate symbolically, we have to 
resort to sampling for example with Markov Chain Monte 
Carlo (MCMC) [1]. 

In the following we will concentrate on HMMs, stochas- 
tic functions of Markov Chains, which have not only been 
used extensively for discrete sequences-pairwise-sequence 
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alignments with pair- HMMs [2], gene finding with labeled 
HMMs [3], and detecting remote homologs using profile 
HMMs [4] -but also for continuous-valued observations, 
such as gene expression time-courses [5]. Continuous 
observation sequences from either DNA microarrays or 
next generation sequencing experiments, note that the 
proportion of mapped reads in an interval is frequently 
used as a continuous measure of copy number, to detect 
chromosomal aberrations or copy number variations lead 
to the same fundamental computational problem and 
share characteristics of the data. The goal is to segment an 
observation sequence into regions in which there is little 
variation around a common mean. In other words, the 
assumption is that the data can be approximately 
described by piece-wise constant functions. Indeed, if 
hybridization intensity was an exact, un-biased measure- 
ment of DNA concentration before amplification, the 
sequence of hybridization intensities of probes along a 
chromosome would yield a piece-wise constant function 
in ArrayCGH experiments. This assumption holds true for 
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a mixture of different cell populations because a finite sum 
of piece-wise constant functions is again a piece-wise con- 
stant function. 

A wide range of methods for copy number detection in 
ArrayCGH data have been developed in recent years, 
including change-point detection based methods [6,7], 
smoothing based methods [8,9], and hierarchical cluster- 
ing [10]. Here, we concentrate on HMM-based approaches 
which have been proposed for segmenting sequences of 
continuous-valued observations and shown to match or 
improve upon the state-of-the-art [11-13]. Once a model 
is trained from the data, either using maximum likelihood 
(ML) or maximum a posteriori (MAP), the segmentation 
is given by the most likely state sequence obtained with 
the Viterbi algorithm [14]. The segmentation, however, is 
highly dependent on the model parameters. A small 
change in the computed parameter values can adversely 
affect the recovered segmentation. A full Bayesian 
approach alleviates this dependence by integrating out the 
model parameters. As analytic integration of a complex 
high dimensional model is impossible for most distribu- 
tions, the Bayesian approach requires the use of numerical 
integration techniques like MCMC [15], for example by 
direct Gibbs sampling [16] of model parameters and state 
paths. Scott [17] reports that combining the forward-back- 
ward recursions [18] and Gibbs sampling improves the 
converge rate and consequently the running time. Never- 
theless, MCMC remains substantially slower than training 
one model and running Viterbi once and the loss in relia- 
bility introduced by relying on one ML or MAP model is 
ignored in practice. For discrete emissions, compressing 
sequences and computing forward and backward variables 
and Viterbi paths on the compressed sequences yields 
impressive speed-ups [19]. However, discretization of con- 
tinuous emissions, similar to vector quantization used in 



speech recognition [18], is not viable as the separation 
between the different classes of observations is low since 
the observations are one-dimensional. Moreover, maximal 
compression is to be expected for small number of dis- 
crete symbols and clearly compression ratio conflicts with 
fidelity in the analysis. 

For a different task, arguments about spatial relations 
between groups of multi-variate data points were used to 
achieve considerable speed-up. Moore and colleagues 
used modified /cd-trees, a data structure to efficiently exe- 
cute spatial queries such as determining the nearest 
neighbor of a given point, to accelerate /c-means [20]. 
The fundamental insight is illustrated in Figure 1 (left). 
In the reassignment step of /c-means one has to find the 
nearest centroid for every data point. Due to the kd-tree, 
there are groups of points contained in a node of the tree 
for which this decision about the nearest centroid can be 
made simultaneously by a geometrical argument about 
the vertices of the hyperrectangle defined by this node. A 
similar kd-tree based approach was used in speech recog- 
nition [21,22] to quickly find the most important compo- 
nents in a mixture of large number of Gaussians and 
thus approximate the full observation density in one indi- 
vidual HMM state with multi-variate emissions. 

At the core of our approach is a similar geometrical 
argument about several uni-variate data points based on 
a modified kd-tree. We adaptively identify blocks of 
observations, cf. Figure 1 (middle). For all observations 
in a block we now estimate, at least conceptually, the 
most likely state simultaneously depending on the 
means of the Gaussians in each state to gain a consider- 
able speed-up proportional to the average block length. 
Similarly, we can avoid sampling states for each indivi- 
dual observation in a block if we can bound the poster- 
ior. Considerable care has to be taken for combining 
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Figure 1 Fundamental insight. When reassigning two-dimensional data points to cluster centroids c and d in /c-means clustering (left) the 
hyperrectangles obtained from /d-trees reduce the computational effort by making an argument about oil points in an hyperrectangle based on 
their vertices; consider for example the rightmost hyperrectangle. For sequences (middle) there is no overlap in y-direction and decisions about 
the most likely state can be made per block considering the means of the Gaussians of a three-state HMM (right), jj., \j= and fj+. Note that at 
every given block only a decision between the two states with closest mean is necessary, if one assume comparable variances. Decision 
boundaries are displayed dashed. 
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blocks and to bound the errors introduced by the 
approximations based on geometric arguments. 
In summary, we 

♦ propose the first use of spatial index structures for 
several consecutive observations and approximate 
computations based on geometric arguments to sub- 
stantially speed-up the problem of segmenting 
sequences of continuous observations using HMM, 

♦ demonstrate that very frequently used biological 
datasets of high relevance measuring chromosomal 
aberration and copy number variations are consis- 
tent with our assumptions of piece-wise constant 
observations with added noise, and 

♦ achieve speed-ups between 10 and 90 on those 
biological datasets while maintaining competitive 
performance compared to the state-of-the-art. 

Methods 

HMM 

We only consider HMMs with Gaussian emission distri- 
butions; see Figure 1 (right) for an example and [18] for 
an introduction. We will use the following notation: N 
denotes the number of states, S = {s lf s 2 , s N } the set of 
states, T the length of an observation sequence O = {o lf 
o 2 , o T } with Of g R, A = {a<ij}i<i,j<N the transition 
matrix, tt = tt 2 , tin) the initial distribution over 
states, B = {(/xi, of) , . . . , (/Zjv, o^)} with \a x < ... < p N are 
parameters of the emission distributions, and Q = {qi, q 2 , 
q T } the hidden state sequences with q t e S. From an 
observation sequence O we can obtain a maximum likeli- 
hood point estimate of the parameters {A, B, tt) using the 
Expectation-Maximization (EM) or Baum-Welch [23] 
algorithm and compute the most likely path using the 
Viterbi algorithm [14]. 

MCMC Sampling 

Bayesian analysis requires choosing prior distributions 
on parameters and we use standard conjugate prior dis- 
tributions following [1]. Specifically, we choose 
fii ~ N(/Xj, <jj), cr~ 2 ~ Gamma(ai, bi),Ai ~ Dirichlet(X Ai ) , and 
tt ~ Dirichlet(X n ) } where jX it a it ai, hi, X Ai , and X n are 
the hyperparameters of the model. 

It is only possible in some trivial cases to compute 
posterior distribution in closed form using analytic inte- 
gration. In most applications, specially for high dimen- 
sional distributions, Monte Carlo integration techniques, 
like MCMC sampling by Gibbs sampling or Metropolis- 
Hastings, are easier to compute and generally produce 
more accurate results than numerical integration [15]. 
Scott [17] compares various MCMC approaches and 
strongly argues in favor of forward-backward Gibbs 
sampling (FBG-sampling), which has been successfully 



used by others [24,25], for it's excellent convergence 
characteristics. We briefly summarize FBG-sampling for 
an HMM X = (A, B, tt); see [17,26] for details: 

1. Choose initial parameters 6° = (A 0 , B°, tt°) 

2. Perform the following steps for iteration 0 < m <M. 

(a) Compute forward variables P{q t = i, ^ 
t \6 m ) for 1 < t < T iteratively using the forward 
algorithm [18] for all L 

(b) Sample q™ ~ P(q T ,O\0 m ) . 

(c) Sample the state sequence Q m in a backward 
fashion for T >t > 1. 

ocP(COi t\e m )a qT<i . 

(d) Sample, 

<9 m+1 ~PriorDistribution(H, O, Q m , 0 m ) 
[H = Set of hyperparameters]. 

Despite its fast convergence, FBG-sampling takes O 
(MTN 2 ) time for M iterations. For long observation 
sequences with millions of observations, as they are 
common in biological applications, realistic values for M 
and N make FGB-sampling intractable. In the next sec- 
tion we discuss how FBG-sampling can be approximated 
to improve the running time to 0(yMTN 2 ), where y is 
the compression ratio of the observation sequence, 
while maintaining good statistical properties. We refer 
to our sampling technique as approximate sampling. 

Approximate sampling 

Through application of a modified kd-tree algorithm 
(details below), we compress the observation sequence 
O = o lt o T into O' = o' v . . . , o' v , cf. Figure 1 (middle), 
and store precomputed first moment, second moment, 
and the number of observations compressed into block 
o[ for all i < T\ In subsequent MCMC iterations we 
assume that observations compressed in a block o[ arise 
from the same underlying state. In other words we 
ignore the contribution of the state paths that do not go 
through the same state for o' v By ignoring those state 
paths, we refer to them as weak state paths, when com- 
puting forward-variables and by reusing pre-computed 
statistics we are able to accelerate sampling. 

At first ignoring weak state paths may sound like a very 
crude approximation for computing forward variables. But 
in many applications this is certainly not true. We demon- 
strate with a symmetric Gaussian HMM that the weak 
state path assumption is a fairly realistic approximation 
and leads to faster sampling. We define a symmetric 
HMM X = (A, B, tt) with N states s lf s N , where we set 
self-transition probability a u = t and non-self-transition 
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probability ciy = — — - for 1 < / * / < N, and B = {(ft lt cr), 

(pL N , a 2 )}. Given a sequence of observations O (assumed 
to be generated by X) and its compressed form O' we 
describe an important lemma and some remarks below. 

Lemma 1 . Let = 01 = 0i = ™<c, = <™,d= - «i 

, pfaio 1 '- 1 ) 

and — A— < a . Assuming there exists a state 

. ( , + fl s IJL S + /JL S \ 

S x s.t. r = min^o mm , o max )>0, we 

can show that ; ; - — - - - 0/|O1 _ 1} < ««i + re)- 1 + ( n - i) C f (i + rr 1 ), 

where r = i—^ and c = e ~j^> 
Proof. See Appendix. 

Remark 1 For realistic values of r, £, and the contribu- 
tion from ignored weak state paths, which we call e, can be 
very small. If e « 1, ignoring weak state paths will not 
introduce large errors in the computation. For the 2-state 
example in Section: Results, where t = 0.9, d = 1, and a 2 = 

0.1, e is at most ^ for a block length n < 10 if we assume 

r > 0.25 and a = 1. If r is much larger and consequently 
c jj- is much smaller, we can roughly say that n can be as 
large as 1 + logi +rc (l + e) in a symmetric Gaussian HMM. 

Remark 2 We often encounter situations where P[q t = 
SxlO 1 ' 1 ) » P{q t * s x \ O 1 ' 1 ). Even though it is not exploited 
in the lemma (a being greater than or equal to 1), as a 
consequence of this, the observation sequence can be 
compressed into larger blocks keeping e small in practice. 

From the above lemma and the remarks we see that 
the longer blocks created by an algorithm should con- 
centrate around the means of the Gaussian distributions. 
While a brute force algorithm looks at local information, 
a kd-tree like algorithm alternately looks at both dimen- 
sions and utilizes global information like the density of 
data points (around the means data concentration 
should be high) to create better quality blocks. We use a 
modified kd-tree based algorithm to find such blocks 
and discuss the details below. 
kd-tree Based Sequence Compression 
Given a starting width parameter w we create a list of 
nodes from the observation sequence O = o lf o T 
using the following steps. 

1. Let O' = q> be the starting list, 3 = 1.25 (picked 
empirically), level L = 1, and dimension d - 1. 

w 

2. If |max(oi) - max(oi)| < — or IOI = 1, create a 

OiEO OiEO 0 L 

node storing the first and second moment of the 
observations in O, append it to 0\ and then go to 
the end step. Otherwise, go to the next step. 



3. If d = 1, find o m , the median value of the observa- 
tions in O. Partition O into maximal set of consecutive 
observations O 1 , 0\ Cf such that V 0GO i0 < o m or 
VoeO i0 ^ °m- For each 0\ go to step 2 with new input 
set 0\ level L + 1, and d = 0. 

4. If d - 0, divide the input set O into two parts O l = 
o lf Oi and O r = o i+1 , 0\ o \ such that 
\Oi-o i+l \ >rnax|o ; -o j+ i| xhen for each set Q i 

and O r , go to step 2 keeping the level value L 
unchanged, and setting d = 1. 

5. End step. 

In the above recursive algorithm, w states the initial 
width, S controls the rate of width shrinking in successive 
levels of the iterations, and O' accumulates the com- 
pressed blocks of observations. The current iteration 
level L, the current dimension d, and the current input 
set O are local variables in the recursive algorithm. 
Notice that we start with an empty list O' and at the end 
of the recursive procedure O' contains an ordered list of 
compressed observations. To gain further compression of 
the sequence, we sequentially go through the blocks of O' 
and combine consecutive blocks if the distance between 
their means is less than w. We also combine three conse- 
cutive blocks if the outer blocks satisfy this condition and 
the inner block has only one observation. In step 3 of the 
above algorithm, the input set is divided into two subsets 
and each subset contains half of the elements from the 
original set. Consequently, the height of the recursion 
tree is at most 2log T and the running time of the above 
algorithm is 0(Tlog T). This overhead is negligible com- 
pared to the time that it takes to run M iterations of 
MCMC sampling. 
Width Parameter Selection 

For increasing values of w the average block size 
increases exponentially in the above kd-tree based com- 

V 

pression. As a result, the compression ratio y = — 

plotted as a function of w, has a knee which can inform 
the choice of w. Moreover, methods originally developed 
to find the optimal numbers of clusters in clustering can 
be used to find the knee of such a curve automatically. 
In particular, we use the L-method [27] which finds the 
knee as the intersection of two straight lines fitted to 
the compression curve. 
Fast Approximate Sampling 

Given the compressed input sequence O' = d v o f 2 , . . . , o' v 
computing forward variables and subsequent sampling is 
a straightforward modification of the uncompressed 
case. In particular we make the following two changes 
to the FBG-sampling algorithm. 
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1. Modified forward variables at positions t* = J] |o-| 

i=i 

are computed using the following formula, 

%=io; t \o) 

= P fe*-Ki =j,0[ t _ x \e)a ]X 

i<i<N 

af~ l Y\P(.0k\lH,Oi). 

constant time computation 
using precomputed statistics 

2. After sampling the state sequence, parameters are 
updated ignoring non-self transitions between conse- 
cutive observations in d v 

Clearly, each iteration of approximate sampling takes 

2 T 
0(T N ) resulting in — times speed up. 

Results 

We evaluate FBG-sampling and approximate sampling 
in three different settings. First, its effectiveness is veri- 
fied for a simple two state model. Then, we test on 
simulated ArrayCGH data which is the accepted stan- 
dard for method evaluation [28]. Finally, we report find- 
ings from an analysis of Mantle Cell Lymphoma (MCL) 
cell lines [29], Corriel cell lines [30], GBM datasets [31], 
and high resolution SNP arrays [13,32]. For biological 
data, if multiple chromosomes are present, we use pool- 
ing [25] across chromosomes, which does not allow 
transition between different chromosomes but assumes 
model parameters to be identical across chromosomes. 
Throughout this section we define o D to be the standard 
deviation of all observations in the dataset. We com- 
press the dataset with increasing values of w = 0.25(7^, 
0.5(7/), 0.75(7/), .... For evaluation we consider the experi- 
ments as two class problems: aberrant clones belong to 
the positive class and normal clones belong to the nega- 
tive class. When ground truth labels of a dataset are 
available we report Fl -measure, recall, and precision for 
the experiment. With tp,fp, tn, fn we denote the num- 
ber of true and false positives and true and false nega- 

tp 

tives respectively. Recall is defined as 7-, precision 



as 



tp 



and Fl-measure as 



tp +fn J 
2 x recall x precision 



tp+fp' * recall + precision 

Experiments were run with a Python implementation on 
a Linux machine with 1.6 GHz Intel Core 2 Duo proces- 
sor and 2 GB memory. For Expectation Maximization 
(EM), we use the Baum-Welch algorithm from the 
GHMM package which is implemented in C and consid- 
erably faster than a Python implementation. 



Synthetic Data 
2-State HMM 

We define a HMM X 2S t = (A, B, n) with 

A = [[o.9,o.i],[o.i,o.9]],B = [(o,o.i),(i,o.i)],7r = ^,^j. From X 2ST we 

sample an observation sequence O = o 1} o 10)000 , and 
run MCMC for M = 100 steps with hyperparameter 
values /xi;2 = 0, 1 for the prior mean on 
tT 1:2 = 0.5,0.5 for the prior variance on ft, a 1:2 = 4, 4 
for the shape of Gamma prior on a" 2 , b xa = 1, 1 for the 
rate of Gamma prior on a 2 , S n = 1, 1 for the Dirichlet 
prior on the initial distribution 7r, and 8±! 2 = 1, 1 for the 
Dirichlet prior on row i of transition matrix A. 

After M iterations, we compare the posterior prob- 
abilities P{q t = i\O,0^ BG ) and P{q t = i\O r 0^) , where 
0^ G and 0^ are M-th parameter samples of FBG- 
sampling and approximate sampling. Figure 2 shows 
that the posterior probability of being in state 1 for 
each position can be approximated fairly well even for 
large values of w. The average posterior error 

p = ^EtEiin* = i\e M , o) - P(<? t = o)\ 

reflects the same fact in Table 1. Similarly, we com- 
pute the Viterbi paths and report total number of mis- 
matches between them along with the likelihoods in 
Table 1. 

Simulation from Genetic Template 

We use 500 simulated datasets published in [28]. Each 
dataset has 20 chromosomes and 100 clones per chro- 
mosome for a total of 2,000 clones per dataset. A four- 
state HMM predicts the aberrant regions-loss defined 
as state Si and gain defined as state S 3 or S 4 . The neu- 
tral region is modeled as state S 2 . We put an ordering 
constraint on the means, [i x <^ 2 <^ 3 <^ 4 , to prevent 
label switching of the states [17]. Hyperparameter 
choices follow [25] and are jl\A = —0.5,0,0.58,1 for 
the prior mean on ^, a 1:4 = 0.5,0.001, 1.0, 1.0 for the 
prior variance on a VA = 10,100, 5, 5 for the shape of 
gamma prior on O"" 2 , and f? 1:4 = 8 n = 8^! 4 = 1, 1, 1, 1 for 
the rate of gamma prior on a" 2 , the Dirichlet prior on 
initial distribution and the Dirichlet prior on row i of 
transition matrix A, respectively. 

Table 2 shows the mean and standard deviation of Fl- 
measure, recall, and precision over the 500 datasets for 
FBG-sampling, approximate sampling, and Expectation 
Maximization (EM) with the ground truth provided by 
[28]. Even for this collection of relatively small datasets 
we see a 10-fold speed up. For each dataset we run FBG 
and approximate sampling for M = 100 steps (we have 
visually monitored the parameters and noticed conver- 
gence within 50 steps, see Figure 3 for a representative 
example). The last 10 samples are used to compute 10 
samples of the posteriors for each state and for each 
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Figure 2 Simulated data: approximate posterior. We show the posterior probability of state 1 (y-axis) for first fifty observations (x-axis) with 
w = 0.5o~ D (top left), 1.0o~ D (top right), 1.5o~ D (bottom left), and 2.0o~ D (bottom right). The true posterior is shown as a solid line, the approximate 
posterior as a dashed line, and their absolute difference is shown in dashed vertical lines. 



position in the observation sequence. Subsequently, 
aberrant regions are predicted based on the average of 
those distributions. We report the speed-up of approxi- 
mate vs. FBG sampling based on the time it takes to 
compress the sequence and run M steps of MCMC. For 
one individual dataset EM requires 58 seconds on aver- 
age, which allows for a total of 800-1000 repetitions 
from randomized points sampled from the prior distri- 
butions in the time needed for FBG sampling. Each run 
continues until the likelihood converges and the best 
model based on likelihood is selected. Aberrant regions 
are predicted and compared against the ground truth 
based on the Viterbi path. We report the mean and 
standard deviation of Fl -measure, recall, and precision 
over the results of EM on 500 datasets. 



Biological Data 

Mantle Cell Lymphoma (MCL) 

De Leeuw and colleagues identified recurrent variations 
across cell lines using ArrayCGH data of MCL cell lines 
[29]. Out of the eight cell lines [29] HBL-2 was fully 
annotated with marked gain and loss regions in the 
autosomes. This dataset contains about 30,000 data 
points (combining all the autosomes). We have used a 
four-state HMM for predicting aberrant regions. State 1 
represents copy number loss, state 2 represents normal 
copy number, state 3 represents single copy gain, and 
state 4 multiple gain. For HBL-2 we report the Fl -mea- 
sure, recall, precision and speed-up. Similar to the syn- 
thetic case we put an ordering constraint on the means, 
fti <{t 2 <f*3 <f*4< Hyperparameter choices follow [25] and 



Table 1 We show the average posterior error p= ^T,tEi\n<it = i\e M ,o)~p(q t = i\e t ™,o)\ and total number of mismatches between 
the two Viterbi paths y, generated by models with parameters (f rue and (f 1 



Method 


iv (in g d ) 


P 


V 


Likelihood 


Time(in sec) 


Speed up 




0.25 


0.001 


3 


-5470 


74 


1.2 




0.50 


0.001 


3 


-5475 


61 


1.4 




0.75 


0.002 


6 


-5469 


35 


2.4 


Approx 


1.00 


0.004 


22 


-5478 


21 


4.2 




1.25 


0.012 


81 


-5588 


13 


6.5 




1.50 


0.054 


410 


-6576 


8 


10.4 




1.75 


0.219 


2345 


-8230 


4 


20.1 




2.00 


0.248 


2857 


-8492 


3 


34.1 


FBG 




0.003 


12 


-5471 


87 


1.0 



True 



-5470 
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Table 2 EM, FBG-sampling, and approximate sampling results for simulated, HBL-2, and Corriel dataset are shown 
here 



Dataset 


Method 


w 


F1 -measure 


Recall 


Precision 


Time 


Compression 


Speed-up 


Likelihood 






0.50 


0.97 ± 0.04 


0.96 ± 0.07 


0.98 ± 0.02 


27 


0.387 


2.2 








0.75 


0.97 ± 0.04 


0.96 ± 0.06 


0.98 ± 0.03 


16 


0.195 


3.7 








1.00 


0.97 ± 0.05 


0.95 ± 0.07 


0.98 ± 0.03 


10 


0.097 


5.9 






Approx 


7.25 


0.96 ± 0.06 


0.94 ± 0.09 


0.98 ± 0.03 


7 


0.050 


8.8 




Simulated 




1.50 


0.94 ± 0.09 


0.92 ± 0.12 


0.97 ± 0.07 


5 


0.031 


11.3 








1.75 


0.91 ± 0.15 


0.89 ± 0.18 


0.96 ± 0.12 


5 


0.023 


12.2 








2.00 


0.86 ±0.19 


0.85 ± 0.21 


0.92 ± 0.19 


5 


0.018 


12.2 






FBG 




0.97 ± 0.04 


0.96 ± 0.05 


0.98 ± 0.03 


58 




1.0 






EM prior, ML 




0.96 ± 0.09 


0.97 ± 0.04 


0.96 ± 0.11 


58 












1.0 


0.85 ± 0.00 


0.83 ± 0.00 


0.88 ± 0.00 


72 


0.078 


11.3 








2.0 


0.87 ± 0.00 


0.83 ± 0.00 


0.91 ± 0.00 


21 


0.018 


39.3 








3.0 


0.89 ± 0.00 


0.83 ± 0.00 


0.95 ± 0.00 


13 


0.006 


61.8 






Approx 


4.0 


0.84 ± 0.08 


0.77 ± 0.1 1 


0.95 ± 0.01 


13 


0.003 


61.9 








5.0 


0.71 ± 0.17 


0.60 ± 0.22 


0.95 ± 0.01 


13 


0.002 


64.8 








6.0 


0.79 ± 0.07 


0.69 ± 0.10 


0.96 ± 0.01 


14 


0.002 


59.3 








7.0 


O./o ± 0.08 


t~\ r a i a 1 1 

0.64 ±0.11 


0.93 ± 0.01 


1 3 


0.001 


61 .4 




HBL-2 


FBG 




0.82 ± 0.00 


r\ o a i r\ r\r\ 

0.84 ± 0.00 


C\ OA l r\ t~\t~\ 

0.80 ± 0.00 


810 




1 .0 






EM prior, ML 




0.65 


0.90 


0.50 


81 0 






1 51 58 




EM prior, best 




0.85 


0.84 


0.86 


810 






9616 




EM prior, mean 




0.76 ± 0.09 


0.86 ± 0.03 


0.68 ±0.12 


810 






13744 




EM unit, ML 




0.64 


0.90 


0.50 


810 






15159 




EM unit, best 




0.86 


0.84 


0.88 


810 






9136 




EM unit, mean 




0.54 ± 0.24 


0.77 ± 0.21 


0.46 ± 0.27 


810 






13457 


GM05296 


Approx 


2.0 


0.96 ± 0.00 


1.00 ± 0.00 


0.93 ± 0.01 


3 


0.027 


13.0 






FBG 




0.89 ± 0.01 


1.00 ± 0.00 


0.81 ± 0.01 


40 




1.0 




GM00143 


Approx 


2.0 


0.98 ± 0.00 


1.00 ± 0.00 


0.96 ± 0.00 


3 


0.027 


13.8 






FBG 




0.89 ± 0.24 


1.00 ± 0.00 


0.86 ± 0.26 


40 




1.0 





Approximate sampling results are reported for different choices of w. The w value which is closest to the one estimated by the L-method is shown in italic. 
Width w is reported in a D of the corresponding dataset, time is reported in seconds, and compression is defined as — For HBL-2, the initial parameter values 
for EM are sampled from the prior or uniform distributions, and the average (mean), most likely (ML), and best (in terftis of F1 -measure) performances along with 
likelihoods are reported. 



are same as for the simulation from genetic template, 
except for cr 1:4 = 0.2,0.1,0.2,0.2, the prior variance on 
{i, and a VA = 15, 20, 10, 10, the shape of gamma prior 
on a' 2 . Settings for FBG-sampling and approximate 
sampling are identical to the simulated case with one 
exception; for each simulated dataset sampling methods 
run once and we report the average and standard devia- 
tion over 500 datasets, but for HBL-2 we let them run 
10 times and report the average and standard deviation 
of these 10 Fl -measures, recalls, and precisions in Table 
2. Each EM run starts with the initial parameter values 
sampled either from the prior distributions, or from uni- 
form distributions, and continues until the likelihood 
value converges. We report the performance of the 
most likely model (which is the preferred criteria to 
select a model), the likelihood of the best model based 
on Fl -measure, and the average and standard deviation 
of Fl -measures, recalls, and precisions of all the models 



generated by EM. As representative examples, we also 
plot the segmentation of chromosome 1 and 9 com- 
puted by FBG-sampling and approximate sampling 
along with the ground truth labels in Figure 4. 
Corriel 

Corriel cell lines were used by Snijders et al. [30] and 
are widely considered a gold standard in ArrayCGH 
data analysis. This dataset is smaller and, in fact, fairly 
easy compared to the MCL cell lines. For the Corriel 
cell line we use a 4-state HMM and report the results 
for GM05296 and GM00143 in Table 2. Again, approxi- 
mate sampling performs competitively while achieving 
more than a 10-fold speed-up. Hyperparameter choices 
follow [24]. 
GBM 

The glioma data from Bredel et al. [31] has previously 
been used to analyze the performance of CNV detection 
methods [9,33]. According to [33], GBM datasets are noisy 
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Figure 3 MCMC convergence. The convergence of posterior probabilities for loss, neutral, and gain of three representative probes-probe 1658, 
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show the outcomes of approximate sampling. Note that the reduction in standard deviation suggests that approximate sampling converges 
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but contains a mixture of aberrant regions with different 
width and amplitude. In particular, chromosome 13 of 
GBM31 is reported to have low amplitude loss in p-arm 
and chromosome 7 of GBM29 is reported to have high 
amplitude gains near the EGFR locus by previous studies 
[9,33]. The segmentation of these two chromosomes are 
shown in Figure 5. Although [33] reports that EM based 
HMM failed to detect these aberrations we see that Baye- 
sian HMM has successfully detected both the gain in 



chromosome 7 and the loss in chromosome 13. For this 
dataset, we use a 3-state HMM with non-informative 

hyperparameters, H13 = ——, 0, — for the prior mean 

on ft, cr 1:3 = 0.2, 0.1, 0.2 for the prior variance on ft, 
111 

01:3 = — / — / — for the shape of gamma prior on a , 

°b °b °b 

S n = 1, 9, 1 for the Dirichlet prior on initial distribution n, 
and bi-3 = 8^! 3 = 1, 1, 1 for the rate of gamma prior on a" 2 
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and the Dirichlet prior on row i of transition matrix A, 
respectively, and at the recommended w value we see a 10 
fold speed-up. 
SNP Array 

High-resolution Single Nucleotide Polymorphism (SNP) 
arrays are capable of detecting smaller CNVs than 
ArrayCGH. To demonstrate the computational advan- 
tage of approximate sampling on SNP arrays we have 
chosen publicly available Affymetrix 100 k pancreatic 
cancer datasets from [32] and Illumina HumanHap550 
arrays of HapMap individuals which are provided as 
examples in PennCNV [13]. An Affymetrix 100 k 



dataset consists of two different arrays each with « 60, 
000 SNP markers and, in total, 10 5 data points per sam- 
ple. On the other hand, the Illumina HumanHap550 
array has around half a million SNP markers. We have 
applied FBG-sampling and approximate sampling with 
w = 1.8(7/), the recommended value by the L-method, to 
the sample datasets from Harada et al [32] and found 
that the computational speed-up is 22-fold (100 runs of 
FBG-sampling takes 1844 seconds). Both sampling 
approaches mostly agree on their predictions but they, 
specially FBG-sampling, detect several more CNVs than 
previously identified [32]. For example, the high 
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amplification in chromosome 11 (sample 33) is success- 
fully identified by all methods but in chromosome 18 
(sample 16) the sampling algorithms find a few normal 
regions previously predicted [32] as loss using the 
CNAG tool [34] (see Figure 6). One possible reason for 
these differences is that while we use 269 HapMap sam- 
ples as reference they use 12 unpublished normal sam- 
ples as reference. Similarly, we have tested our method 
with 2.0<J D < w < 3.0(7/) against Illumina HumanHap 
samples and observed 70 to 90 fold speed-up in compu- 
tations (100 runs of FBG-sampling takes 9693 seconds). 
These samples are taken from apparently healthy indivi- 
duals and contain very few CNVs. As expected, both 



sampling algorithms' predictions are nearly identical and 
they seem to predict extreme outliers as aberrant mar- 
kers. In contrast, PennCNV [13] does not report CNVs 
which are covered by less than 3 SNPs, thus suppressing 
the outliers as normal. We plot a typical region (from 
lAe + 08bp to 1.7e + 08bp) of chromosome 6 from 
sample 3 (ID 99HI0700A) in Figure 7. 

To set hyperparameters we follow the default para- 
meters of the HMM used in PennCNV [13]. We have 
observed that HMMs for large arrays are particularly 
sensitive to the self-transition probabilities (which is also 
reflected in the default parameter values of the HMM 
used in PennCNV). Hence, hyperparameters were set to 
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gain (blue) clones are identified using FBG-sampling and approximate sampling. For approximate sampling w = 1.8o~ D is used, which was 
recommended by the L-method. 



reflect the choice of high self-transition probability for 
each state-we set 8^! 3 = ctij, cti 2 l, ctij, the Dirichlet prior 
on row i of transition matrix A, where / = 5000, an is 

0.99 for i = 2, 0.95 for i * 2, aij = 1 for i * j. 

Other hyperparameters for the 3-state HMM were set 
such that the expected values of prior distributions 
match the default values for PennCNV. In particular, 
they were jX\^ = 10.66, 0, 0.54 for the prior mean on ^, 
cr 1:3 = 0.001,0.001,0.001 for the prior variance on ft, 
ai :3 = 12, 30, 25 for the shape of gamma prior on o~~ 2 , 
b 1:3 = 1, 1, 1 for the rate of gamma prior on <7~ 2 , and 



S n = 1, 9, 1 for the Dirichlet prior on initial distribution 
ji, respectively. 

Discussion 
EM vs. MCMC 

As already a 4-state Gaussian HMM has 23 free para- 
meters applying EM is often difficult due to the exis- 
tence of multiple local optima and the local 
convergence of EM. Consequently, the estimate has to 
be repeated many times with randomly initialized para- 
meter values to find the most likely model. It should 
also be noted that not necessarily the model maximizing 
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the likelihood exhibits the best performance in classify- 
ing aberrations 2. Additionally, applying any constraint 
in an EM settings, i.e. ordering of the state means, is 
harder than in MCMC. EM also produces inferior 
results on datasets exhibiting class imbalance, for exam- 
ple where one type of aberrations (observations for one 
HMM state) are rare or missing, while MCMC can 
overcome this problem using informative priors. In 
Table 2 we see that MCMC sampling performs better 
than EM on real biological data which is consistent with 
prior reports from Guha [24] and Shah [25] who also 
report difficulties with EM and better MCMC perfor- 
mances. In particular, for HBL-2 we observe that the 
best model in terms of Fl-measure-which is not avail- 
able in de novo analysis-is not the most likely model 
and the most likely model has very low precision and, 
consequently, worse Fl-measure than MCMC sampling. 
On the simulated datasets, EM performs poorly if the 
dataset is imbalanced. As a result we observe slightly 
worse standard deviation for the precisions and Fl -mea- 
sures computed by EM in Table 2. We also notice from 
the confusion matrix of three classes-loss, neutral, and 
gain- that even though the mean Fl-measure, recall, and 
precision (defined on two classes, neutral and aberrant) 
are high, due to label switching [17], EM does not dis- 
tinguish gain from loss, and vice versa, very well (see 
Table 3). By re-ordering the already learned state means 
the label switching problem can be addressed, but that 
increases misclassification rate due to state collapsing as 
the parameter values, specially means of the Gaussians, 
become almost identical [17]. In contrast, Bayesian 
methods cope with class imbalance problem by applying 
order constraints. Moreover, using McNemars test [35] 



on the combined result of the 500 datasets we have veri- 
fied that the predictions based on EM are significantly 
different from the predictions made by Bayesian meth- 
ods with p-values being less than 0.001 in both cases. 
FBG vs. Approximate Sampling 

In an ideal setting, like the 2-state HMM example, 
approximate sampling closely mimics the performance 
of FBG sampling up to moderate compression level. For 
the simulated and real dataset approximate sampling's 
performance is comparable to FBG's while achieving a 
speed-up of 10 or larger. We also observe that for 
higher compression levels approximate sampling reports 
small number of aberrant clones, which results in small 
tp and fp values, but large fn value. As a result, we see 
low recall and high precision rate when the compression 



Table 3 Confusion matrices showing the proportion of 
accurate predictions based on EM, FBG-sampling, and 
approximate sampling methods on 500 simulated 
datasets 







Loss 


Truth 
Neutral 


Gain 




Loss 


0.855 


0.071 


0.074 


EM 


Neutral 


0.001 


0.996 


0.003 




Gain 


0.190 


0.087 


0.723 




Loss 


0.980 


0.020 


0.000 


FBG 


Neutral 


0.002 


0.995 


0.003 




Gain 


0.002 


0.020 


0.973 




Loss 


0.981 


0.019 


0.000 


Approx. {w = 1 .25<7 D ) 


Neutral 


0.002 


0.993 


0.005 




Gain 


0.009 


0.022 


0.969 
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level is too high for a particular dataset (see the rows 
with w > 4.0ob for HBL-2 in Table 2). 

From Figures 4, 5, 6, and 7 we observe that segmenta- 
tions by both sampling methods are almost identical at 
the recommended width w value. In case of HBL-2, they 
differ from the ground truth in some places. They pre- 
dict a few extra gain regions and outliers are generally 
predicted as gains. We, as well as Shah et al. [25], have 
noticed that the HBL-2 dataset has many outliers, and 
the variance of emission distribution of gain state 4 con- 
verges to a high value which tries to explain the outliers. 
In contrast, GBM has fewer outliers (see Figure 5) and 
approximate sampling seems robust to those outliers. As 
the compression algorithm forces possible outliers to be 
included in a compressed block, it is robust to moderate 
frequencies of outliers. 
Width Parameter 

By varying the width parameter w we can control the com- 
pression ratio y and the speed-up achieved by approximate 
sampling. But from Table 1 and 2, and Lemma 1 it is also 
clear that by setting a large value one can get unfavorable 
results. We have analyzed the effect of different w values 
using a synthetic dataset with a controlled level of noise 
following [36] . Each dataset has three chromosomes with 
total probe counts 500, 750, and 1000. Ten aberrant 
regions of size 11-20 probes, randomly assigned gain or 
loss, are inserted in random positions of the 500 probe 
chromosome. Similarly, 15 aberrant regions of size 11-25 
probes, randomly assigned gain or loss, are inserted into 
larger chromosomes. A noise component N{0, a) is added 
to the theoretical log-ratios -1,0,0.58 (loss, neutral, and 
gain respectively) to model the data. For a set of noise 
levels, a ranging from 0.1 to 0.5, 50 synthetic datasets are 
generated. We use a 3-state HMM with width parameter 
values in the range 0<J D , 4?o D (where o D is the standard 
deviation of the dataset). Signal-to-noise ratio (SNR) is 
0 58 

defined as — — . In Figure 8 we plot the mean compres- 
o 

sion ratio, Fl-measure, recall, and precision for 50 syn- 
thetic datasets and the real biological data HBL-2. For all 
noise levels the compression ratio drops exponentially 
with increasing values of w. Ideally, one would like to 
compress as much as possible without affecting the quality 
of the predictions. We visually identified a best value for 
width as the point after which the Fl-measure drops sub- 
stantially. Comparing the knee of the curve with the best 
value, we notice that while using the knee computed by L- 
method [27] is a conservative choice for width, in most 
cases we can still obtain a considerable speed-up. 
Outliers 

Gaussian HMMs are known to be sensitive to outliers 
which is evident from our results of HBL-2 and SNP 
arrays. Traditionally, outliers have been handled either by 



using a mixture distribution as the emission distribution 
or by preprocessing the data to remove possible outliers 
or impute more appropriate values. We have observed 
that a simple local median approach works very well to 
identify the outliers in a time series of log 2 -ratio values. 
Although using a mixture distribution or a distribution 
with fat tails, i.e. Student's-t distribution, is a better 
choice we lose a significant computational advantage in 
approximate sampling. For a block of observations o' - 

k 

Oi, we can compute T\Pi°)W'@) m constant time 

H 

k k 

using precomputed values J2 °j an d J2 °f if the emission 

j=l j=i 

distribution is Gaussian. But it is not obvious how we can 
accomplish this for a more complicated distribution. 
Another approach, which we prefer in this context, is to 
use a HMM state with a very wide Gaussian and low self- 
transition probability to model the outliers. We have 
observed very good performance in this way. However, as 
our primary focus is to compare FBG-sampling with 
approximate sampling we choose to use a simple Gaus- 
sian model at the end. 

Conclusions 

Analyzing CGH data either from DNA microarrays or 
next generation sequencing to estimate chromosomal 
aberrations or investigate copy number variations 
(CNV), leads to the problem of segmenting sequences of 
observations which are essentially noisy versions of pie- 
cewise-constant functions. For reasons of efficiency, ML 
or MAP point estimates of HMM parameters combined 
with the Viterbi-algorithm to compute a most likely 
sequence of hidden states and thus a segmentation of 
the input are most popular in practice. This ignores 
research which clearly demonstrates that Bayesian 
approaches, where MCMC is used for sampling and 
thus integrating out model parameters, is more robust 
with higher recall and higher precision [24]. Addition- 
ally, our experiments show that likelihood is not infor- 
mative with respect to the quality of CNV calls putting 
the use of ML into question even if the estimation pro- 
blem could be solved. 

We propose a method using approximate sampling to 
accelerate MCMC for HMMs on such data. Our method 
constitutes the first use of ideas from spatial index struc- 
tures for several consecutive observations and approxi- 
mate computations based on geometric arguments for 
HMM; the effectiveness of this approach was previously 
demonstrated for /c-means clustering, mixture estimation, 
and fast evaluation of a mixture of Gaussians. 

We demonstrate that these very abundant biological 
CGH datasets, which measure chromosomal aberrations 
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T 
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showing best width is selected by visual inspection. 



and copy number variations, are consistent with our 
assumptions of piece-wise constant ground truths, and 
we are able to achieve speed-ups between 10 and 60 
respectively 90, on these biological datasets while 



maintaining competitive prediction accuracy compared 
to the state-of-the-art. As datasets with even higher 
resolution, both from higher density DNA microarrays 
and next generation sequencing, become available, we 
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believe that the need for precise and efficient MCMC 
techniques will increase. The added precision over pop- 
ular ML/MAP-based methods is of particular biological 
relevance as for complex neurodegenerative diseases 
such as Autism de-novo copy number variations have 
recently been shown to play a role [37]; a precise and 
quick analysis on large collectives of patients is 
desirable. 

Applying approximate sampling to multi-dimensional 
observations-to jointly analyze data sets for recurrent 
CNVs [38] instead of analyzing individuals and post- 
processing results-and considering more complicated 
HMM topologies and observation densities are direc- 
tions for our future work. 

Appendix 

LCIXIITIJI. 1 Let O x ~ x =0\,...,0i-\,d = 0i, ...,Ou n -\,d min =mmo\, o' max = maxoud = mm \fij — fj^kl 

, PfalO'" 1 ) 

and A— < a . Assuming there exists a state 

P{q l =s x \0^) ~ * 

s x s.t. t = mm \o min , o max j > o, we can 

< a {{i+rc) n - 1 + (n- \) C ^{i + r) n -% where 



E Pfa q Hn -i,o'\0'- 1 ) 

^ow that ffiT..,^,^ 
r= — - and 



C = e 2a 2 



Proof. Using the assumption on x, for any position i < 
I < i + n - 1, we can argue that, 



1 / 0j-fl qi \ 2 

1 (°i ~ /V 



< e- 



(1) 



1 \{qi = s Xf 
dr 

e cr 2 otherwise. 



For any partial state path q b . . ., 

P{q i ,...,q i+n - 1 ,o'\O i - 1 ) 
= P{q l \O l - 1 )P{o l \q u 0 1 - 1 ) 



n a wk + i p (ok + i\<ik + i) 

1 I0 { - fl qi 



k=i 



p{q x \a- 1 ) 



(2) 



_l/0k + l ~ Hq k+l 

i+n—2 9 

ne z 

k=i 



We partition S n , the set of all possible partial 
state paths of length n, into N subsets S Sl . . . S Sn 

SUCh that, S s ' = [SeS n : (V Sl¥Sj C(S,s,) > C(S,si)) V ((V slf r Sj C(S, sj) > C(S,s,)) A Si = S;)} 

for 1 < j < N, where C ( S ' 5 ) = E 1 We 
again partition S s i = U^J^S^ such that, 

^{SeS s i:(^(S^Si + i)j=fe}. 

The size of S n can be expressed in terms of total num- 
ber of non-self-transitions present in a path, 



|S n | =N n 



n-l 



k=0 



(3) 



As the sets S 5 i are equal sized partitions of S n , 
\S Sj \ = "jl ("^ (N- Also notice that, by 

definition, the partial state paths in <S W with exactly /< 
number of non-self-transitions are equally distributed 
among the subsets S s i . As a result, 

$1 = ( n fe 1 )( N " 1 ) fe - 

Now we define S [s] = {{q b ^ +w _i) : = ... = q Un ^ 
= s)}. For the remaining part of the proof, if Y is a set 
of partial state paths, we use P(Y, o'|0' _1 ) in place of 

e m ^n-i^'io- 1 ) forclarit 

(<ji,...,(/, t „_i)eY 



E ^t H ,-i,o , |o < - 1 ) 

(* <? l+ „-i)eS" 



4; + „_l =S,0'|O'- 1 ) 



(4) 



En* = - 

seS 

P{S n f o , \O i - 1 ) 

~~ ^p^Wio*- 1 ) 

seS 

P{S n f o , \O i - 1 ) 
C PfSHo'lO 1 '- 1 ) 

, N | P(S^ / 0 , |O < - 1 ) 
= Up( S [ 5x ] /0 /| O i-i)' 



Now we derive an upper bound of the contribution 
from state paths in S 5 *. In the following equations we 
make use of the fact that a state path with k non-self- 

k 

transitions goes through at least - non-s^ states. 
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PtSSo'lO 1 - 1 ) 
P(SN / o / |O l - 1 ) 

E E PC^c/IO 4 - 1 ) 

= p^Ho'lo*'- 1 ) 



EE 



P(S, o'lO 1 '" 1 ) 
P(S^] / 0 / |O 1 '- 1 ) 



E E 



fe=o 



s € si 



3 — t/i/ • • • / t/i+n— 1 



,j-iv V V20 7 / 



n 

n-l 

= E E 

S € S : 



fc=0 



PtolO" 1 ) ' + fr ^ 
P(S,|0-i) I f a SA 



S — Cji, . . . , Cji +n —\ 



n 



..-( : 



S e 



5 — Cji/ • • • i Cji+n—1 

fOj ~ ^ 

e 



n 



]=i 



( 0) ~ f^s x 



k 

dx \ 2 



e o^ 



2 1 



fe=0 
n-l 



fe 

dr \ 2 
2 I 



fc=0 v 7 

+rc)"" 1 . 



(tr)* 



(5) 



Similarly, we derive an upper bound of the contribu- 
tion from state paths in S s y , where 1 < y * x < N. Now 
we use the fact that, because of the pigeonhole principle 

n 

any state path in S s y has to go through at least — non- 



N 



s^ states. 

P^o'lO'" 1 ) 
PfS^WlO''- 1 ) 

n-l 

*E E 



fc+n-1 ^ 2 V O- 



/ 1-t \ 'T? g g / 

a \{N-i)t) 11 l p-^ y 



n 



0 or 



n 

dt\~jsj 
~~ 2 I 



n 

^2 I 



fc=0 
n-l 
fc=0 



n - l\ /I - t 
fe 



0 O" 



n 

dz \~fj 



2n 



E 

fe=0 



(6) 



In 

= ac~M "(1 +r)"- 1 . 
Applying (5) and (6) in (4) we get, 

£ P(<7< ^n-l^'IO^ 1 ) 

(qi,...,q i+n -i)eS n 

E = = ■ ■ ■ = 4i + n-l = ^ O'lO 1 '" 1 ) 

5GS 

2n 

< a((l + rcf- 1 + (N - l)c N (l + rf" 1 ). 

Note: For simplicity of the notation, we follow the 
convention that /x Xo = — oo and M* N+1 = oo so that the 
proof holds for x = 1 or x = N. 
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